{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Access a Database with Python - Iris Dataset\n",
    "\n",
    "The Iris dataset is a popular dataset especially in the Machine Learning community, it is a set of features of 50  Iris flowers and their classification into 3 species.\n",
    "It is often used to introduce classification Machine Learning algorithms.\n",
    "\n",
    "First let's download the dataset in `SQLite` format from Kaggle:\n",
    "\n",
    "<https://www.kaggle.com/uciml/iris/>\n",
    "\n",
    "Download `database.sqlite` and save it in the `data/iris` folder."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p><img   src=\"https://upload.wikimedia.org/wikipedia/commons/4/49/Iris_germanica_%28Purple_bearded_Iris%29%2C_Wakehurst_Place%2C_UK_-_Diliff.jpg\" alt=\"Iris germanica (Purple bearded Iris), Wakehurst Place, UK - Diliff.jpg\" height=\"145\" width=\"114\"></p>\n",
    "\n",
    "<p><br> From <a href=\"https://commons.wikimedia.org/wiki/File:Iris_germanica_(Purple_bearded_Iris),_Wakehurst_Place,_UK_-_Diliff.jpg#/media/File:Iris_germanica_(Purple_bearded_Iris),_Wakehurst_Place,_UK_-_Diliff.jpg\">Wikimedia</a>, by <a href=\"//commons.wikimedia.org/wiki/User:Diliff\" title=\"User:Diliff\">Diliff</a> - <span class=\"int-own-work\" lang=\"en\">Own work</span>, <a href=\"http://creativecommons.org/licenses/by-sa/3.0\" title=\"Creative Commons Attribution-Share Alike 3.0\">CC BY-SA 3.0</a>, <a href=\"https://commons.wikimedia.org/w/index.php?curid=33037509\">Link</a></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let's check that the sqlite database is available and display an error message if the file is not available (`assert` checks if the expression is `True`, otherwise throws `AssertionError` with the error message string provided):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "data_iris_folder_content = os.listdir(\"data/iris\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "error_message = \"Error: sqlite file not available, check instructions above to download it\"\n",
    "assert \"database.sqlite\" in data_iris_folder_content, error_message"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Access the Database with the sqlite3 Package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `sqlite3` package from the Python standard library to connect to the `sqlite` database:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import sqlite3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "conn = sqlite3.connect('data/iris/database.sqlite')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cursor = conn.cursor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "type(cursor)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `sqlite3.Cursor` object is our interface to the database, mostly throught the `execute` method that allows to run any `SQL` query on our database.\n",
    "\n",
    "First of all we can get a list of all the tables saved into the database, this is done by reading the column `name` from the `sqlite_master` metadata table with:\n",
    "\n",
    "    SELECT name FROM sqlite_master\n",
    "    \n",
    "The output of the `execute` method is an iterator that can be used in a `for` loop to print the value of each row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for row in cursor.execute(\"SELECT name FROM sqlite_master\"):\n",
    "    print(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "a shortcut to directly execute the query and gather the results is the `fetchall` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cursor.execute(\"SELECT name FROM sqlite_master\").fetchall()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Notice**: this way of finding the available tables in a database is specific to `sqlite`, other databases like `MySQL` or `PostgreSQL` have different syntax."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can execute standard `SQL` query on the database, `SQL` is a language designed to interact with data stored in a relational database. It has a standard specification, therefore the commands below work on any database.\n",
    "\n",
    "If you need to connect to another database, you would use another package instead of `sqlite3`, for example:\n",
    "\n",
    "* [MySQL Connector](https://dev.mysql.com/doc/connector-python/en/) for MySQL\n",
    "* [Psycopg](http://initd.org/psycopg/docs/install.html) for PostgreSQL\n",
    "* [pymssql](http://pymssql.org/en/stable/) for Microsoft MS SQL\n",
    "\n",
    "then you would connect to the database using specific host, port and authentication credentials but then you could execute the same exact `SQL` statements.\n",
    "\n",
    "Let's take a look for example at the first 3 rows in the Iris table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sample_data = cursor.execute(\"SELECT * FROM Iris LIMIT 20\").fetchall()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(type(sample_data))\n",
    "sample_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "[row[0] for row in cursor.description]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is evident that the interface provided by `sqlite3` is low-level, for data exploration purposes we would like to directly import data into a more user friendly library like `pandas`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import data from a database to `pandas`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "iris_data = pd.read_sql_query(\"SELECT * FROM Iris\", conn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "iris_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "iris_data.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pandas.read_sql_query` takes a `SQL` query and a connection object and imports the data into a `DataFrame`, also keeping the same data types of the database columns. `pandas` provides a lot of the same functionality of `SQL` with a more user-friendly interface.\n",
    "\n",
    "However, `sqlite3` is extremely useful for downselecting data **before** importing them in `pandas`.\n",
    "\n",
    "For example you might have 1 TB of data in a table stored in a database on a server machine. You are interested in working on a subset of the data based on some criterion, unfortunately it would be impossible to first load data into `pandas` and then filter them, therefore we should tell the database to perform the filtering and just load into `pandas` the downsized dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "iris_setosa_data = pd.read_sql_query(\"SELECT * FROM Iris WHERE Species == 'Iris-setosa'\", conn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "iris_setosa_data\n",
    "print(iris_setosa_data.shape)\n",
    "print(iris_data.shape)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
